Red Wine Quality

I am analyzing the Red Wine Quality dataset provided by Udacity. The purpose is to detect if any of the physiochemical properties distinguish between excellent, good and poor quality wines.

Created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009

The red wine variant is of the Portuguese “Vinho Verde” wine.

For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009].

Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.)

The dataset was created using red wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).

What marks a good quality wine?

Before starting my project, I took some time learning about basic wine characteristics. I learnt that the main fundamental traits of wine are sweetness, acidity, tannin, alcohol and body. In addition, I looked through the given red wine dataset to get a feel for the input and output variables.

Red Wine Dataset

This tidy data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

Input variables (based on physicochemical tests):

  1. – fixed acidity (tartaric acid - g / dm^3)
  2. – volatile acidity (acetic acid - g / dm^3)
  3. – citric acid (g / dm^3)
  4. – residual sugar (g / dm^3)
  5. – chlorides (sodium chloride - g / dm^3
  6. – free sulfur dioxide (mg / dm^3)
  7. – total sulfur dioxide (mg / dm^3)
  8. – density (g / cm^3)
  9. – pH
  10. – sulphates (potassium sulphate - g / dm3)
  11. – alcohol (% by volume)

Output variable (based on sensory data):

  1. – quality (score between 0 and 10)

I’d like to explore which input variables had an impact on the wine quality (output variable) ratings.

So let’s begin with the data exploration.

Univariate Plots Section

Red Wine Dataset Structure

Here is the structure of the red wine dataset.

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

There are 1599 observations and 13 variables. ‘quality’ is an output variable. ‘X’ is an observation identifier. ‘quality’ and ‘X’ are integers. The rest of the variables are numeric.

Statistical summary of the red wine variables

Below you will find statistic information on the mean, median, minimum, maximum, 1st quartile and 3rd quartile on all the variables.

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

Quality

My main feature of interest is the quality variable. So, let’s look at the summary of the quality variable.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

From the summary above, the wine quality is ranging from 3 to 8. The median value is 6. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). Let’s look at the quality variable on a bar chart.

The quality variable follows a normal distribution shape with discrete integer values. Majority of the wine quality fall mostly in 5 and 6. There are very few exceptionally excellent or poor quality wines. The minimum rating is 3 and the maximum rating is 8 for quality. It is clear that there are much more good wines than excellent or poor wines. In addition, not a single wine received a score of 0,1,2,9 or 10.

How many wines are “Poor”, “Good” and “Excellent” ?

##      Poor      Good Excellent 
##        63      1319       217

There are 63 poor quality wines, 1319 good quality wines and 217 excellent quality wines.

Quality and Ratings Bar Charts

The “Good” ratings has far more wines than the “Poor” and “Excellent” ratings. I am surprised that none of the wines had quality level higher than 8 and less than 3. There were very wines in the ‘Poor’ and ‘Excellent’ ratings.

Now that we have seen the nature of the red wine quality, let’s explore the physicochemical input variables.

Plot the input variables into histograms

For the single variable analysis, I am going to plot a series of histograms. These histograms will show the distribution of each of the input variable.

From the input variable histograms above, we can see some interesting distributions. Let’s look at some of these histograms.

I am particularly interested in the following input variables:

  1. Acidity – How tart is the wine?
    • Fixed Acidity
    • Volatile Acidity
    • Total Acidity
    • Citric Acid
  2. Residual Sugar – How sweet or dry(not sweet) is the wine?
  3. Alcohol – How much does the wine warm the throat?

Acidity – How tart is the wine?

Acidity is a fundamental property of wine, imparting sourness or tartness and resistance to microbial infection. Acidity of a wine helps to determine how finished wine will taste, how it feels in the mouth and how well it will age. A low acidity wine will taste flat and boring while too much acid can lead to tartness or a sour wine.

Fixed Acidity

The predominant fixed acids found in wines are tartaric, malic, citric and succinic. All these acids originate in grapes with the exception of succinic acid. Succinic acid is produced during the fermentation process. Grapes are one of the rare fruits that contain tartaric acid.Tartaric is one of the strongest acids in wine and controls the acidity of wine. Tartaric Acid plays a critical role in the taste, feel and color of a wine. But even more important, it lowers the pH enough to kill undesirable bacteria, acting as a preservative.

Fixed Acidity Variable

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

Fixed acidity is slightly skewed to the right. The slight positive skew has a long tail extending out to a max value of 15.90 g/dm^3. The median value is 7.90 g/dm^3 and the mean is 8.32 g/dm^3. There is a slight skew in the data because there are a few wines which has a very high fixed acidity.

Volatile Acidity Variable

Volatile acids are produced through microbial action such as yeast fermentation, malolactic fermentation and other fermatations carried out by spoilage organisms. The most prominent volatile acid in wine is acetic acid. Acetic acid bacteria require oxygen to grow, therefore, elimination of any air in wine containers and sulfur dioxide addition will limit their growth. Our palates are quite sensitive to the presence of volatile acids and for that reason winemakers try to keep their concentrations as low as possible.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

Volatile acidity has a bimodal distribution with peaks around 0.4 and 0.6 g/dm^3. The median volatile acidity is 0.52 g/dm^3, and the mean is 0.5278 g/dm^3.

Total Acidity Variable

Total acidity of a wine is the combined sum of fixed and volatile acids present. So what does total acidity tell us? When we have a glass of wine our mouth is largely unable to tell the difference between fixed and volatile acids. If there is an overwhelming quantity of any single acid, say citric, we may be able to pick out their contribution to the wine. In the case of citric acid, the wine may have citrus overtones to it. An over abundance of a volatile spoilage acid can give obvious flavors and aromas to us. But even so it must be out of balance for us to notice this one particular acid.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.120   7.680   8.445   8.847   9.740  16.285

Total acidity distribution looks similar to fixed acidity distribution. This is not surprising because volatile acid numbers are much smaller than the fixed acidity. The median total acidity is 8.445 g/dm^3. There is an increase of 0.545 g/dm^3 over the median fixed acidity.

Citric Acid Variable

Most, if not all of the citric acid naturally present in the grapes is consumed by bacteria during fermentation. The absence of citric acid would bring the fermentation process to a grinding halt, this almost never happens though.

Citric acid plays a major role in a winemakers influence on acidity. Many winemakers use citric acid to acidify wines that are too basic and as a flavor additive. This process has is benefits and drawbacks. Adding citric acid will give the wine “freshness” otherwise not present and will effectively make a wine more acidic. The major drawback is bacteria use citric acid in their metabolism, thus the citric acid added may just be consumed by bacteria, promoting the growth of unwanted microbes.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000
## 
##    0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09  0.1 0.11 0.12 0.13 0.14 
##  132   33   50   30   29   20   24   22   33   30   35   15   27   18   21 
## 0.15 0.16 0.17 0.18 0.19  0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 
##   19    9   16   22   21   25   33   27   25   51   27   38   20   19   21 
##  0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39  0.4 0.41 0.42 0.43 0.44 
##   30   30   32   25   24   13   20   19   14   28   29   16   29   15   23 
## 0.45 0.46 0.47 0.48 0.49  0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 
##   22   19   18   23   68   20   13   17   14   13   12    8    9    9    8 
##  0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69  0.7 0.71 0.72 0.73 0.74 
##    9    2    1   10    9    7   14    2   11    4    2    1    1    3    4 
## 0.75 0.76 0.78 0.79    1 
##    1    3    1    1    1

The most common value was 0.00 (132 wines). The next common value was 0.49 (68 wines). Some of the higher values do not have any data.

Residual Sugar Variable - How sweet or dry(not sweet) is the wine?

Residual sugar is the sugar that remains in a wine after fermentation completes. Often the very first impression of a wine is in its level of sweetness. The greater the amount of residual sugar, the sweeter the wine. Residual sugar is balanced by acidity, alcohol and tannins in wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

A log10 scale was applied to the residual sugar to get a better visualization of the distribution. Residual sugar had a median of 2.20 g/dm^3 with a long tail that extended out to 15.50 g/dm^3

Chloride Variable

Chlorides is the amount of salt in wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

I transformed the long tail distribution with a log10 scale so it could be better visualized. After the transformation, the chlorides histogram appears normal, with some outliers on the right side and left side of the curve. Chlorides had a mean of 0.087 g/dm^3 and a median of 0.079 g/dm^3.

Free Sulfur Dioxide Variable

Free sulfur dioxide prevents micobial growth and the oxidation of wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

There are more wines in the dataset with low levels of free sulfur dioxide than those with more. On average wines contain 15.87 mg/dm^3 of free sulfur dioxide.

Total Sulfur Dioxide Variable

Total sulfur dioxide is the amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

This is the amount of free and bound forms of sulfur dioxide. Similar to free sulfur dioxide,the distribution of total sulfur dioxide is also positively skewed with few wines with extreme values of total sulfur dioxide. There are two large outliers in this dataset. The mean and median for total sulfur dioxide is 46.47 mg/dm^3 and 38.00 mg/dm^3 respectively.

Density Variable

The density of water is close to that of water depending on the percent alcohol and sugar content.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

Density of water in the wine is one of the few normally distributed variables in this dataset.The median and mean is roughly the same(0.99 g/cm^3).

pH Variable

pH describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

The pH has a normal distribution with a median of pH 3.3 and mean at pH 3.3. Both the mean and the median is about the same.

Sulphates Variable

Sulphates is a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

The distribution of sulphates is positively skewed with a few outliers. The average amout of sulphates is 0.6 g/dm^3. We applied a log10 transformation to get a better visualization of the sulphate distribution.

Alcohol Variable

Alcohol is the product of fermentation of the natural grape sugars by yeasts and without it wine simply doesn’t exist. The amount of sugar in the grapes determines what the final alcohol level will be. The conversion of sugar to alcohol is such a vital step in the process of making wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

The average percentage alcohol in wine for this dataset is around 10.42%. The alcohol values were skewed toward larger precentages, between 10.2% to 14.9%.

Univariate Analysis

What is the structure of your dataset?

This tidy data set contains 1,599 red wines with 13 variables. 11 input variables were on the chemical properties of the wine. There was a one input variable called “X” which was an identifier and one output variable. All of them were numerical variable except for “X”and “quality”. “X”and “Quality” are integers.

At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

Other observations include:

  • Most of the wine have quality 5 or 6 on a scale of 0-10.
  • Average total acidity in wine was 8.847 g/dm^3
  • Mean alcohol amount is 10.42%
  • Average sugar amount is 2.54 g/dm^3 with the maximum 15.5 g/dm^3

What is/are the main feature(s) of interest in your dataset?

The main feature of interest in the data set is quality. I’d like to know which input features are best for determining the wine quality.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Fixed acidity, volatile acidity, citric acid, residual sugar and alcohol likely contribute to the quality of wine. After doing extensive research, I think acidity and alcohol probably contribute most to the wine quality. This based on the current univariate assumption. This may change in the bivariate and multivariate analysis.

Did you create any new variables from existing variables in the dataset?

Yes, I created a couple of variables. First, a new variable called ‘ratings’ was created to reflect the quality ranges based on the quality variable. The ranges are in an ordered factor with levels “Poor”, “Good” and “Excellent”. Next, I created a variable called ‘total.acidity’ by summing fixed and volatile acidities.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The data was already in a tidy format. Hence, there was no need for any additional formatting on the data. I did remove “X” variable from the data set. “X” variable represented the row number and was not required for the analysis. Volatile acidity had a bimodal distribution with peaks around 0.40 and 0.60 g/dm^3. Citric acid distribution was different in the sense that the most common value was 0.00. Some of the distribution was skewed to the right. As a result, a logarithmic transformation was applied to better interpret the data.

Bivariate Plots Section

Let’s investigate the correlation between the variables and get a feel for potential relationships in the data.

Correlation Coefficient Diagram

I decided to use the correlation diagram to see the correlation coefficent between the input variables and output variable. The main purpose is to give a high level snapshot of the relationship of these variables.

Here are the results of correlation coefficient between the quality variable and the input variables.

Input Variables Pearson Correlation
fixed.acidity 0.12
volatile.acidity -0.39
citric.acid 0.23
residual.sugar 0.01
chlorides -0.13
free.sulfur.dioxide -0.05
total.sulfur.dioxide -0.19
density -0.17
pH -0.06
sulphates 0.25
alcohol 0.48
total.acidity 0.09

There does not seem to be much strong correlation with quality. This is quite surprising to me. The only strongest correlation with quality was alcohol(0.48). There were other weak positive correlation such as sulphates(0.25),citric acid(0.23), fixed acidity(0.12) and residual sugar(0.01). I was also surprised that sulphates had an influence on quality. Volatile acidity(-0.39) on the other hand had a strong negative correlation towards quality.

Let’s look at the boxplots to see the relation between quality and the physiochemical variables. The following graphs represents boxplots between quality level [3-8] against each input variable.

Compare each of the input variable with the quality variable

The box plots diagram description

  • The black dots represents outliers.
  • The black line inside the boxplot represents the median [50%].
  • The blue dot inside the boxplot and the blue line attaching them to each other represent the mean for each quality level.
  • For a distribution that is positively skewed, the box plot will show median(black line in the boxplot) closer to the lower or bottom quartile.
  • For a distribution that is negatively skewed, the box plot will show median(black line in the boxplot) closer to the upper or top quartile

Fixed Acidity vs Quality Bivariate Plot

The mean increases from quality level 4 to 8. Fixed Acidity has almost no effect on the Quality. The mean and median values of fixed acidity remains almost unchanged with the increase in quality. Fixed acidity has weak positive correlation with Quality.

Volatile Acidity vs Quality Bivariate Plot

The mean decreases from quality level 3 to 7, and increases a little bit to 8. Volatile acidity seem to have a negative relationship with Quality. There is a definite trend in lower volatile acidity levels as wine quality increases. We know from the background information that high levels of volatile acidity can cause the wine to taste like vinegar. This inverse relationship between volatile acidity and quality makes sense.

Total Acidity vs Quality Bivariate Plot

A combined sum of fixed acid and volatile acid gives total acidity of a wine. An essential trait in wine that’s necessary for quality. The mean increases from quality level 4 to 8. It has a similar pattern as fixed acidity. Total Acidity has almost no effect on the Quality. The mean and median values of total acidity remains almost unchanged with the increase in quality. Total acidity has weak positive correlation with Quality. This is quite surprising.

Citric Acid vs Quality Bivariate Plot

The mean remains the same from 3 to 4 but then it starts to increase from 5 to 8. Citric acid seem to have a positive correlation with Quality. Higher the citric acid the better the wine quality.

Residual Sugar vs Quality Bivariate Plot

Residual sugar is the amount of sugar remaining after fermentation stops. The mean values for the residual sugar is almost the same for every quality of wine. Residual sugar has almost no correlation with Quality. We can conclude that residual sugar is about the same in all levels of quality. The sweetness of the wine is the same across all quality levels.

Chlorides vs Quality Bivariate Plot

The mean significantly decreases from quality level 3 to 4, then slowly decreases all the way to 8. Even though it is a weak negative correlation, the box plots shows the lower the amount of chloride the better the quality of wine.

Free Sulfur Dioxide vs Quality Bivariate Plot

The mean increases from 3 to 5 and then gradually decreases from 5 to 8. Lower concentration of free sulfur dioxide seem to be prevalent more in poor and excellent wines. Higher concentration are found in good quality wine. Excellent quality wine seem to have a much lower free sulfur dioxide. Free sulfur dioxide has very weak negative correlation with Quality.

Total Sulfur Dioxide vs Quality Bivariate Plot

Total sulfur dioxide is the amount of free and bound forms of S02. The mean increases from 3 to 5 and then decreases from 5 to 8. It has the same pattern as the free sulfur dioxide. Low concentration of total sulfur dioxide seem to be prevalent in poor and excellent wines. Higher concentration are found in good wines. Total sulfur dioxide has very weak negative correlation with Quality.

Density vs Quality Bivariate Plot

The mean decreases from quality level 3 to 4 and then increases from 4 to 5. However, the quality level decreases from 5 to 8. It seems like lower densities produces better wines.

pH vs Quality Bivariate Plot

The mean remains the same between quality level 3 to 4. However, it decreases from 4 to 5. It increases slightly from 5 to 6. It then slowly decreases from 6 to 8. Better wines seem to have less pH. The lower the the pH number, the more intense the acids present in the wine.

Sulphates vs Quality Bivariate Plot

The mean steadily increases from 3 to 8. Better quality wines have a stronger concentration of sulphates.

Alcohol vs Quality Bivariate Plot

The mean increases from 3 to 4 and then decreases from 4 to 5. However, it sigificantly increases from 5 to 8. It is very clear that better quality wines has higher alcohol content.

Other Relationships:

Acidity and pH

Comparing the relationship between pH with total acidity.

From the plot above there is a negative linear relationship between total acidity and pH.

Citric Acid and pH

Comparing the relationship between pH with total acidity.

From the plot above there is a negative linear relationship between citric acid and pH. Total acidity and citric acid has a strong correlations with pH. The lower the pH number, the more intense the acidity in wines.

Alcohol and Residual Sugar

Residual sugar is the amount of sugar remaining after fermentation stops. It is sugar that are not converted into alcohol during fermentation. It is clear from the scatter plot above, most of the wines have residual sugar in it regardless of the level of alcohol content. In dry wine, yeasts consume almost all of the sugar from the grapes. In sweet wine, the yeasts are killed before all the sugar is used, leaving behind residual sugars. However, even wines that taste very dry will have some degree of residual sugar.

Total Acidity and Density

Density and total acidity has a positive linear relationship. If the wine has fixed acids that don’t evaporate readily then the wine is more dense.

Density and Alcohol

The relationship between alcohol and density is negative. Alcohol is lighter than water. Hence, density decreases with increased alcohol. Fermentation is a natural process allowing the transformation of grape juice - the must - into wine. During fermentation, the density of the must progressively diminishes, until reaching a value from 0.990 to 0.995. Values greater than 1 mean the presence of sugar. During fermentation, the sugar in the juice is converted into alcohol. During the wine making process, the density of sugar is greater than the density of alcohol in water. The more sugar is consumed by the yeast, the more alcohol we get. The density of wine is primarily determined by the concentration of alcohol, sugar, glycerol, and other dissolved solids. Sweeter wines generally have higher densities.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Based on my research about wine, acidity, residual sugar, alcohol, tannin and body play a crucial role in wine quality. Hence, in the univariate analysis, I mainly focused on fixed acidity, volatile acidity, total acidity, citric acid, residual sugar and alcohol. However, in the bivariate analysis, I decided to analyze all the input variables against quality. There were some interesting discoveries. They are as follows: -

  1. Fixed acidity, total acidity and residual sugar hardly had any impact on quality.
  2. Alcohol had the strongest correlation and clear positive relationship with quality.
  3. Volatile acidity had an inverse relationship with quality. This is what it should be. High levels of volatile acids will lead to vinegar like flavors. The lesser the volatile acids, the better the taste of wine.
  4. Sulphates had a positive relationship with quality. Sulphates tend to act as an antimicrobial agent and an antioxidant agent. As an antimicrobial agent, it regulates the growth of harmful yeast and bacterial growth in the wine. As an antioxidant, it guards against browning and protects the fruit-like qualities of the wine.
  5. Citric acid also had a positive relationship with quality. Citric acid is used to increase acidity and add fresh sensation to the wine.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Yes, I did observe some interesting relationships between the other input variables. Here are some of my observation:

  1. Total acidity and pH had a strong correlation. It was about -0.67. The higher the total acidity, the lesser the pH. This result was not surprising at all. This is because Malolactic fermentation takes place whereby the tart-tasting malic acid is converted to softer tasting lactic acid. This process increases the pH and lowers the acidity.

  2. A comparison was done between residual sugar and alcohol. It has a weak positive correlation. During fermentation, yeasts metabolize sugars for energy, yielding alcohol as a major byproduct. In dry wine, yeasts consume almost all of the sugar from the grapes. In sweet wine, the yeasts are killed before all the sugar is used, leaving behind residual sugars. Regardless of the level of alcohol content, most wines will have some form of residual sugar.

  3. Density and total acidity had a very strong correlation. It was about 0.68. If the wine has fixed acids that don’t evaporate readily then the wine is more dense. Acids present in the wine has densities greater than water.

  4. Density and alcohol had an inverse relationship. Alcohol is lighter than water. Hence, density decreases with increased alcohol. During fermentation, the density of the must progressively diminishes, as alcohol is generated.

What was the strongest relationship you found?

Features with the strongest relationships to wine quality are as follows:

  1. Alcohol (Correlation: 0.48)
  2. Volatile Acidity (Correlation: -0.39)
  3. Sulphates (Correlation: 0..25)
  4. Citric Acids (Correlation: 0.23)

Some of the correlations between other features showed strong relationships. They were between total acidity and density (0.68), followed by pH and fixed acidity (-0.68), pH and total acidity (-0.67), fixed acidity and citric acid (0.67), density and fixed acidity (0.67) and free sulfur dioxide and total sulfur dioxide (0.67).

Multivariate Plots Section

Let’s do some multivariate plots of the following combinations of input variables with ratings quality:

  • alcohol
  • volatile acidity
  • sulphates
  • citric acid

We will use the ratings variable (Poor, Good & Excellent) created in the beginning of this report.

Citric Acid, Volatile Acidity and Quality Ratings

Low volatile acids and high citric acid produces excellent wines(red)

Sulphates, Volatile Acidity and Quality Ratings

Low volatile acids and high sulphates produces excellent quality wines(red).

Alcohol, Volatile Acidity and Quality Ratings

In the plot above, excellent wines (red) has low volatile acidity(y-axis) and high alcohol content(x-axis).

Sulphates, Citric Acid and Quality Ratings

High citric acid and high sulphates produces excellent wine quality(red).

Alcohol, Citric Acid and Quality Ratings

When citric acids and alcohol content are high then we have excellent wine quality(red).

Alcohol, Sulphates and Quality Ratings

High sulphates and high alcohol by volume leads to excellent wines(red).

Alcohol, Density and Quality Ratings

As density decreases, alcohol increases. As a result, we get excellent quality wines(red).

Citric Acid, Fixed Acidity and Quality Ratings

When fixed acidity and citric acid increases, the quality of wine rises. The high quality wine is in red.

Linear Regression Model

For this linear regression model, I used variables that had some good correlations with quality. They were alcohol, volatile acidity, sulphates, chlorides, pH, citric acid and density.

## 
## Calls:
## m1: lm(formula = I(quality) ~ I(alcohol), data = wine)
## m2: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity, data = wine)
## m3: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + log10(sulphates), 
##     data = wine)
## m4: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + log10(sulphates) + 
##     log10(chlorides), data = wine)
## m5: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + log10(sulphates) + 
##     log10(chlorides) + pH, data = wine)
## m6: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + log10(sulphates) + 
##     log10(chlorides) + pH + citric.acid, data = wine)
## m7: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + log10(sulphates) + 
##     log10(chlorides) + pH + citric.acid + density, data = wine)
## 
## ======================================================================================================================
##                          m1            m2            m3            m4            m5            m6            m7       
## ----------------------------------------------------------------------------------------------------------------------
##   (Intercept)           1.875***      3.095***      3.369***      3.069***      4.225***      4.842***     -6.479     
##                        (0.175)       (0.184)       (0.184)       (0.199)       (0.360)       (0.449)      (11.909)    
##   I(alcohol)            0.361***      0.314***      0.303***      0.282***      0.295***      0.302***      0.312***  
##                        (0.017)       (0.016)       (0.016)       (0.017)       (0.017)       (0.017)       (0.020)    
##   volatile.acidity                   -1.384***     -1.156***     -1.100***     -0.987***     -1.110***     -1.137***  
##                                      (0.095)       (0.097)       (0.098)       (0.102)       (0.115)       (0.118)    
##   log10(sulphates)                                  1.477***      1.713***      1.690***      1.742***      1.715***  
##                                                    (0.177)       (0.187)       (0.186)       (0.187)       (0.189)    
##   log10(chlorides)                                               -0.491***     -0.612***     -0.564***     -0.573***  
##                                                                  (0.128)       (0.131)       (0.132)       (0.133)    
##   pH                                                                           -0.448***     -0.595***     -0.598***  
##                                                                                (0.116)       (0.133)       (0.133)    
##   citric.acid                                                                                -0.276*       -0.331*    
##                                                                                              (0.121)       (0.134)    
##   density                                                                                                  11.274     
##                                                                                                           (11.852)    
## ----------------------------------------------------------------------------------------------------------------------
##   R-squared             0.227         0.317         0.345         0.352         0.357         0.360         0.360     
##   adj. R-squared        0.226         0.316         0.344         0.350         0.355         0.357         0.357     
##   sigma                 0.710         0.668         0.654         0.651         0.648         0.647         0.647     
##   F                   468.267       370.379       280.646       216.010       177.270       148.984       127.822     
##   p                     0.000         0.000         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood    -1721.057     -1621.814     -1587.752     -1580.357     -1572.954     -1570.342     -1569.887     
##   Deviance            805.870       711.796       682.108       675.828       669.599       667.415       667.035     
##   AIC                3448.114      3251.628      3185.503      3172.714      3159.908      3156.683      3157.774     
##   BIC                3464.245      3273.136      3212.389      3204.977      3197.548      3199.700      3206.168     
##   N                  1599          1599          1599          1599          1599          1599          1599         
## ======================================================================================================================

The R-squared value for the model was about 23% which was very low. R-squared values increased slightly for each addition of input variables. Based on the R-squared, I am able to predict 36% of what determines the quality of wine comes from the amount of alcohol and volatile acidity used. However, sulphates, chlorides and pH contributed to various degrees. Citric acid and density seem to not contribute very much.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Excellent wines had a rating of 7 and 8. From the multivariate plots, citric acid, sulphates and alcohol tend to increase the quality of wine. Citric acid and sulphates strengthened each other when looking at the wine quality. This is because they both appear positively associated with wine quality. In addition, citric acid and alcohol both maximize each other to get excellent wines. Citric acid and alcohol have a positive relationsip. Citric acid has a significant influence on fixed acidity as well.

Were there any interesting or surprising interactions between features?

I thought the most interesting interactions were volatile acidity and density. I never really thought that these would play a crucial role in the wine quality. However, from the plots above, volatile acidity and density does impact the quality of wine. One other surprising feature is sulphates. When I started the univariate analysis, I did not consider sulphates to affect quality. However, the multivariate analysis proved me wrong.

OPTIONAL: Did you create any models with your dataset?

Yes, I did create a linear regression model. Given the dataset, it appears that it is rather difficult to predict the quality of wine. The R squared value is low. The most important predictor is alcohol. The model is still useful because it shows the importance of each input variables. The one limitation I see is the lack of variation in the dataset.

Final Plots and Summary

In this analysis, I tried to understand how quality of the red wine is determined by the input variables. I created many plots to see if I could detect any of the features affected the red wine quality. I’ll share three plots that perhaps stood out to me. These plots are derived from the analysis done above.

Plot One

I started my investigation on the red wine with quality. I wanted to see the various range of quality present in this dataset. The wine quality range fell between 0 (poor) and 10 (excellent). The plot provided me with key information on where most wines fell in the quality range.

Description One

In this plot, I wanted to see the different levels of red wine quality. I was surprised to see that a large majority of them fell in either 5 or 6. As result, I grouped them in three different ratings, namely “Poor”, “Good” and “Excellent”. It was clear most wines were “Good”. Followed by “Good” wines were the “Excellent” wines. There were very few “Poor” wines. Also, from the plot above there aren’t any wine with 0, 1, 2, 9 and 10 quality score. Hence, this led me to ask more questions as to which variable impacted the different red wine quality levels.

Plot Two

When I plotted the bivariate plots, alcohol has the strongest correlation with red wine quality as compared to the other input variables. It was clear alcohol plays a crucial role in wine quality.

Statitical Box Plot Distribution between Alcohol vs Quality

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.575  11.000 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

Statistical Relationship between Alcohol and Quality

## 
##  Pearson's product-moment correlation
## 
## data:  wine$quality and wine$alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663

Description Two

Alcohol was the strongest variable that had the most impact on the wine quality. As the alcoholic content increases, the quality of wine increases as well. Some key features of this plot are the median(12.15) of quality 8 is greater than the upper quartile(11.30) of quality 7, and the lower quartile(11.32) of quality 8 is greater than the upper quartile(11.30) of quality 6. These features emphasize that there is a tendency whereby high quality wine has high alcohol content in them. The strong relationship(0.48) between alcohol and wine quality shows a clear positive trend.

Plot Three

In the final plot, I wanted to show how the two strongest correlation coefficient, alcohol and volatile acidity made an impact on red wine quality.

Description Three

Excellent quality red wines seem to have low volatile acidity and high alcohol content. This makes sense. Too much of volatile acids, such as acetic acids, begin to taste like vinegar or furniture polish. As a result, the wine will be undrinkable. The lesser the volatile acids the better the wine quality.

Alcohol is the product of fermentation of the natural grape sugars by yeasts, and without it wine simply doesn’t exist. Grapes mostly contain water and sugar. As the wine undergoes fermentation, the sugar is absorbed by the yeast. This process then creates the alcohol content in wine. As a result, grape juice turns into wine. High quality red wine generally have high alcohol content. Most red wines are high in alcohol. For example, Zinfandel, Shiraz and Madeira are high alcohol wines.

Reflection

In this analysis, there were some difficulties. Most of the correlation coefficients had either a weak relationships or negligible relationships. This was an indication to me that perhaps the dataset is too small or there are some missing variables. Based on my wine research, there are many other variables that contributes to the wine quality. This includes but not limited to, the temperature, tannin levels, speed, oxygen levels, glyserol levels, grape quality, climate and etc. Even though alcohol had a major impact on wine quality, the statistics reveals alcohol only had moderate positive correlation. Hence, it felt like this dataset did not have the necessary variables to measure the red wine quality.

The next struggle I had was the quality variable itself. I had categorized the wine quality to “Poor”, “Good” and “Excellent”. Most wines fell in the “Good” ratings, which was 1319 wines. Our dataset had 1599 wines. So the “Good” rating wine had an extremely high number. In addition, there were not a single red wine that had a quality of 0,1,2,9 and 10. It seems odd to me why this was the case.

The next hardship I had was, I was not a wine drinker nor was I familiar with the wine tasting and wine making process. So I decided to read many articles, papers and blog posts about wine making process. I must say I learned a lot about red wine composition, fermentation, wine traits and vinification that helped me tremendously in analyzing this data set. This learning process took me awhile before I actually started my exploratory data analysis project.

Another difficulty I had was using R for explortory data analysis. I have never used R in data analysis. Hence, I went through each video and lesson in Udacity’s Explore and Summarize Data to educate myself in R statistical programming. I also used the instructor’s notes website links to learn more about R. It has been a learning curve in action for me. I learned about the different R packages, R markdown files, R scripts, how to quantify single, double and multi-variables, transforming data, ggplots, scatter plots, correlations and even linear regression model. This was a high learning curve but I trully enjoyed R.

When I started this project, I just wanted to focus on a few variables for the univariate plots. But then I realized that my curiosity led me to investigate all the other variables in bivariate and multivariate plots. After plotting the bivariate and multivariate plots, I decided to go back to the univariate plot and plot the rest of the input variables. I wanted to really see what actually affected the wine quality. The discovery was amazing. I discovered that the factors which affected the quality of the wine the most were alcohol, and volatile acids.

First, I noticed that some wines didn’t have citric acid at all. The most common value was 0.00 (132 wines). I thought that something is definitely not right with the dataset. I decided to do some research on wine and its relationship with citric acid. From my research, citric acid is actually added to some wines to increase the acidity. This is because citric acids add ‘freshness’ and flavor to wines. So it made sense to me that some wines would not have any citric acid at all because they were not added.

In my analysis, volatile acids had an inverse relationship with wine quality. This was unexpected. I thought all acids played an important role in the taste of wine and henced increased the wine quality. But it is not necessarily the case with volatile acids. Acetic acid is the most common acid found in volatile acids. Acetic acid is what gives the wine a sour vinegar taste. From my wine research, large quantities of acetic acid bacteria means, the wine is considered spoiled. Keeping this volatile acid to a minimum is important in the wine making process. The lesser the volatile acids, the better the quality of wine. One of the ways to reduce volatile acids is by adding sulfur dioxide to keep harmful bacteria in check. This brings us to the next variable, sulphate.

I was surprised that sulphates had an impact on wine quality. This was least anticipated when I was plotting the graphs. Sulphate is a wine additive which can contribute to sulfur dioxide gas levels. Sulfur dioxide acts as an antimicrobial that prevents microbial growth in wine and also an antioxidant that prevents oxidation of wine. Most wineries are likely to add sulfur to the macerated grapes and/or must. It protects the must from bacteria and mold that might have been transmitted to the grape clusters either in the vineyard or on the way to the winery. Sulfur is not added during fermentation. When the wine has fermented as much as it will, sulfur is then added to protect the wine through aging. The wine’s pH and alcohol level will contribute to how much sulfur is added prior to bottling. This was an interesting revelation for me.

In the bivariate and multivariate analysis, high alcohol content hands down was the variable that impacted the quality of the red wine. The crushed grapes/must sweet sugary juice is transformed into alcohol through fermentation. This is key in wine. Without it there is simply no wine. So it is clear that high quality wine will have greater alcohol content. This is evident in the bivariate and multivariate analysis which showed “Excellent” wines had high alcohol content.

For future analysis, I would love to have a dataset, with more input variables that would reflect the quality of wine. For example, variables such as temperature, tannin levels, speed, oxygen levels, glyserol levels, grape quality, climate, price, etc. would perhaps add more depth to the analysis. Another possible next step is to apply machine learning to provide more accurate predictions on red wine quality.

In conclusion, this course and the red wine analysis was a positive experience. I trully enjoyed the challenges it offered and solving them. R is a great tool for visualization and data exploration. I feel more confident in R then I was ever before.